Template-Based Information Mining from HTML Documents

نویسندگان

  • Jane Yung-jen Hsu
  • Wen-tau Yih
چکیده

Tools for mining information from data can create added value for the Iqternet. As the majority of electronic documents available over the network are in unstructured textual form, extracting useful information from a document usually involves information retrieval techniques or manual processing. This paper presents a novel approach to mining information from HTML documents using tree-structured templates. In addition to syntactic and semantic descriptions, each template is designed to capture the logical structure of a class of documents. Experiments have been conducted to extract FAQ information automatically frorn over one hundred HTML documents collected from the Web. Using two basic templates, the prototype FAQ Miner has accurately analyzed 65% of the collection of FAQ documents. With additional processing to handle “near-pass”es, the success rate is approximately 75%. The preliminary results have demonstrated the utility of structural templates for mining information from semi-structured text-based documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Content Extraction to Facilitate Web Mining

Internet continuously strives to become the prime source of knowledge and Information, used in almost every sphere of life. As the volume and complexity of the Information shared on WEB is increasing, various forms of representation of this data has been emerged. In order to deal with different forms of data, different technologies have been discovered to efficiently provide the Information to ...

متن کامل

Efficient Algorithm for Mining on Bio Medical Data for Ranking the Web Pages

Information in the internet is evolving in terms of high volume through different sources. Extracting tuples from HTML pages has been an important issue in various web applications such as web data integration, e-commerce market monitoring, and mash ups that repurpose and selectively combine existing web data services. Data Mining is the process of analyzing data from different perspectives and...

متن کامل

Efficient Algorithm for Mining on Bio Medical Data for Ranking the Web Pages

Information in the internet is evolving in terms of high volume through different sources. Extracting tuples from HTML pages has been an important issue in various web applications such as web data integration, e-commerce market monitoring, and mash ups that repurpose and selectively combine existing web data services. Data Mining is the process of analyzing data from different perspectives and...

متن کامل

A ME Model Based on Feature Template for Chinese Text Categorization

With entering into information society and the Internet developing rapidly, people could acquire more and more information. How to utilize Internet information efficiently and promptly, has became a hotspot in information technology. Text categorization is an important component to help getting useful message from tremendous amount of vast information. And it assigns new documents to pre-define...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997